The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent

نویسندگان

  • Zhanxing Zhu
  • Jingfeng Wu
  • Bing Yu
  • Lei Wu
  • Jinwen Ma
چکیده

Understanding the generalization of deep learning has raised lots of concerns recently, where the learning algorithms play an important role in generalization performance, such as stochastic gradient descent (SGD). Along this line, we particularly study the anisotropic noise introduced by SGD, and investigate its importance for the generalization in deep neural networks. Through a thorough empirical analysis, it is shown that the anisotropic diffusion of SGD tends to follow the curvature information of the loss landscape, and thus is beneficial for escaping from sharp and poor minima effectively, towards more stable and flat minima. We verify our understanding through comparing this anisotropic diffusion with full gradient descent plus isotropic diffusion (i.e. Langevin dynamics) and other types of positiondependent noise.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks

Stochastic gradient descent (SGD) is widely believed to perform implicit regularization when used to train deep neural networks, but the precise manner in which this occurs has thus far been elusive. We prove that SGD minimizes an average potential over the posterior distribution of weights along with an entropic regularization term. This potential is however not the original loss function in g...

متن کامل

Regularizing Deep Neural Networks by Noise: Its Interpretation and Optimization

Overfitting is one of the most critical challenges in deep neural networks, and there are various types of regularization methods to improve generalization performance. Injecting noises to hidden units during training, e.g., dropout, is known as a successful regularizer, but it is still not clear enough why such training techniques work well in practice and how we can maximize their benefit in ...

متن کامل

Inference, Converges to Limit Cycles for Deep Networks

Stochastic gradient descent (SGD) is widely believed to perform implicit regularization when used to train deep neural networks, but the precise manner in which this occurs has thus far been elusive. We prove that SGD minimizes an average potential over the posterior distribution of weights along with an entropic regularization term. This potential is however not the original loss function in g...

متن کامل

Identification of Multiple Input-multiple Output Non-linear System Cement Rotary Kiln using Stochastic Gradient-based Rough-neural Network

Because of the existing interactions among the variables of a multiple input-multiple output (MIMO) nonlinear system, its identification is a difficult task, particularly in the presence of uncertainties. Cement rotary kiln (CRK) is a MIMO nonlinear system in the cement factory with a complicated mechanism and uncertain disturbances. The identification of CRK is very important for different pur...

متن کامل

Forecasting Global Temperature Variations by Neural Networks

Global temperature variations between 1861 and 1984 are forecast using regularization network, mul-tilayer perceptrons, linear autoregression, and a local model known as the simplex projection method. The simplex projection method is applied to characterize complexities in the time series in terms of the dependence of prediction accuracy on embedding dimension and on prediction-time interval. N...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018